Extra Credit Exercises (optional)

If you have successfully completed the material for Linear Regression with One Variable, congratulations! You now understand linear regression and should able to start using it on your own datasets.

For the rest of this programming exercise, we have included the following optional extra credit exercises. These exercises will help you gain a deeper understanding of the material, and if you are able to do so, we encourage you to complete them as well.

Linear regression with Multiple Variables

In this part, you will implement linear regression with multiple variables to predict the prices of houses. Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices.


In [ ]:
# Import numpy for linear algebra and numerical computing functions, and matplotlib for plotting graphs
import numpy as np
from numpy import ones, zeros, newaxis, r_, c_, mat, dot
from numpy.linalg import pinv
import matplotlib.pyplot as plt

from IPython.html import widgets
from IPython.html.widgets import *
from IPython.display import display

# Enable matplotlib inline plotting for this notebook
%matplotlib inline

The file ex1data2.txt contains a training set of housing prices in Portland, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.


In [ ]:
data = np.loadtxt('../data/ex1data2.txt', delimiter=',')
unormalized_X = mat(data[:, :2]) # training example inputs
y = c_[data[:, 2]]   # training example outputs
m = unormalized_X.shape[0]

data[:5] # First 5 rows of training examples (just for viewing)

Feature Normalization

The next cell will start by loading and displaying some values from this dataset. By looking at the values, note that house sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, first performing feature scaling can make gradient descent converge much more quickly.

Your task here is to complete the code in feature_normalize() to

  • Subtract the mean value of each feature from the dataset.
  • After subtracting the mean, additionally scale (divide) the feature values by their respective “standard deviations.”

The standard deviation is a way of measuring how much variation there is in the range of values of a particular feature (most data points will lie within $\pm$2 standard deviations of the mean); this is an alternative to taking the range of values (max-min).

In Numpy, you can use the “std” function to compute the standard deviation. For example, inside feature_normalize(), the quantity X[:,1] contains all the values of x1 (house sizes) in the training set, so std(X[:,1]) computes the standard deviation of the house sizes. At the time that feature_normalize() is called, the extra column of 1’s corresponding to $x_0$ = 1 has not yet been added to X.

You will do this for all the features and your code should work with datasets of all sizes (any number of features / examples). Note that each column of the matrix X corresponds to one feature.

Implementation Note: When normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation used for the computations. After learning the parameters from the model, we often want to predict the prices of houses we have not seen before. Given a new x value (living room area and number of bedrooms), we must first normalize x using the mean and standard deviation that we had previously computed from the training set.}


In [ ]:
def feature_normalize(X):
    """Normalizes the features in X
    returns a normalized version of X where the mean value of each feature
    is 0 and the standard deviation is 1. This is often a good preprocessing
    step to do when working with learning algorithms.
    """
    
    # You need to set these values correctly
    X_norm = X.copy().A
    mu = zeros(X.shape[1])
    sigma = zeros(X.shape[1])
 
    ### YOUR CODE HERE ###
    # Instructions: First, for each feature dimension, compute the mean
    #               of the feature and subtract it from the dataset,
    #               storing the mean value in mu. Next, compute the
    #               standard deviation of each feature and divide
    #               each feature by it's standard deviation, storing
    #               the standard deviation in sigma.
    #
    #               Note that X is a matrix where each column is a
    #               feature and each row is an example. You need
    #               to perform the normalization separately for
    #               each feature.
    #
    # Hint: You might find the 'mean' and 'std' functions useful.
    #
 
    return mat(X_norm), mu, sigma

In [ ]:
# Normalize input matrix X
# Note: Although normalization is idempotent,
# some implementations of feature_normalize may alter mu and sigma
# if called multiple times. To make sure this is not happening, mu and
# sigma are printed here
X, mu, sigma = feature_normalize(unormalized_X)
print("mu = ", mu)
print("sigma = ", sigma)

Gradient Descent

Previously, you implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix X. The hypothesis function and the batch gradient descent update rule remain unchanged.

You should complete the code in compute_cost and gradient_descent() to implement the cost function and gradient descent for linear regression with multiple variables. If your code in the previous part (single variable) already supports multiple variables, you can use it here too.

Make sure your code supports any number of features and is well-vectorized. You can use X.shape[1] to find out how many features are present in the dataset.

Implementation Note: In the multivariate case, the cost function can also be written in the following vectorized form:

$$J(\theta) = \frac{1}{2m} (X\theta - \vec{y})^T (X\theta - \vec{y})$$

where

$$ X = \begin{bmatrix} - (x^{(1)})^T - \\ - (x^{(2)})^T - \\ \vdots \\ - (x^{(m)})^T -\end{bmatrix} y = \begin{bmatrix} y^{(1)}) \\ y^{(2)}) \\ \vdots \\ y^{(m)})\end{bmatrix} $$

The vectorized version is efficient when you’re working with numerical computing tools like Numpy. If you are an expert with matrix operations, you can prove to yourself that the two forms are equivalent.


In [ ]:
def compute_cost(X, y, theta):
    """Compute cost for linear regression with multiple variables
    computes the cost of using theta as the parameter
    for linear regression to fit the data points X and y
    """
    # Initialize some useful variables
    m = X.shape[0] # number of training examples

    # You need to return the following variables correctly
    J = 0

    ### YOUR CODE HERE ###
    # Instructions: Compute the cost of a particular choice of theta
    # You should set J to the cost.
    
    return J

In [ ]:
def gradient_descent(X, y, theta, alpha, num_iters):
    """Performs gradient descent to learn theta
    updates theta by taking num_iters gradient steps
    with learning rate alpha
    """
    # Initialize some useful values
    m = X.shape[0] # number of training examples
    J_history = zeros(num_iters)
    
    for i in range(num_iters):
        ### YOUR CODE HERE ###
        # Instructions: Perform a single gradient step on the parameter vector
        #               theta. 
        #
        # Hint: While debugging, it can be useful to print out the values
        #       of the cost function (computeCost) and gradient here.
        #
        
        # Save the cost J in every iteration
        J_history[i] = compute_cost(X, y, theta)
        
    return theta, J_history

In [ ]:
X, mu, sigma = feature_normalize(unormalized_X) # normalize with mu=0, std=1
X = c_[np.ones(m), X] # Add a column of ones to X for the theta_0 intercept term

theta = c_[np.zeros(3)] # initialize fitting parameters to 0

iterations = 400
alpha = 0.01

theta, J_history = gradient_descent(mat(X), y, theta, alpha, iterations)

print('Theta found by gradient descent:\n', theta)

Optional (ungraded) exercise: Selecting learning rates

In this part of the exercise, you will get to try out different learning rates for the dataset and find a learning rate that converges quickly. We will loop over different learning rates

The next phase will call your gradient_descent() function and run gradient descent for about 50 iterations at the chosen learning rate. The function should also return the history of $J(\theta$ values in a vector J. After the last iteration, the cell will plot the J values against the number of the iterations.

If you picked a learning rate within a good range, your plot decrease quickly and level off at around 25 iterations. If your graph looks very different, especially if your value of $J(\theta$ increases or even blows up, adjust your learning rate and try again. We recommend trying values of the learning rate $\alpha$ on a log-scale, at multiplicative steps of about 3 times the previous value (i.e., 0.3, 0.1, 0.03, 0.01 and so on).

You may also want to adjust the number of iterations you are running if that will help you see the overall trend in the curve.


In [ ]:
X, mu, sigma = feature_normalize(unormalized_X) # Normalize with mu=0, std=1
X = c_[np.ones(m), X] # Add a column of ones to X for the theta_0 intercept term
 
iterations = 100

def plot_convergence(alpha):
    theta = c_[np.zeros(3)]
    theta, J_history = gradient_descent(X, y, theta, alpha, iterations)
    
    # Plot the convergence graph
    plt.plot(r_[:iterations], J_history, linewidth=1)
    plt.xlim(0, 40)
    plt.xlabel('Number of iterations')
    plt.ylabel('Cost J')
interact(plot_convergence, alpha=widgets.FloatSliderWidget(min=0,max=1.29,step=0.03,value=0.5))

Implementation Note: If your learning rate is too large, $J(\theta$ can diverge and ‘blow up’, resulting in values which are too large for computer calculations.

Python Tip: To compare how different learning learning rates affect convergence, it’s helpful to plot J for several learning rates on the same plot. In Matplotlib, this can be done by performing gradient descent multiple times with a ‘hold on’ command between plots. Concretely, if you’ve tried three different values of $\alpha$ (you should probably try more values than this) and stored the costs in J1, J2 and J3, you can use the following commands to plot them on the same figure:

plot(1:50, J1(1:50), ‘b’);
hold on;
plot(1:50, J2(1:50), ‘r’);
plot(1:50, J3(1:50), ‘k’);

The final arguments ‘b’, ‘r’, and ‘k’ specify different colors for the plots.

Notice the changes in the convergence curves as the learning rate changes. With a small learning rate, you should find that gradient descent takes a very long time to converge to the optimal value. Conversely, with a large learning rate, gradient descent might not converge or might even diverge!

Using the best learning rate that you found, run next cell with gradient descent until convergence to find the final values of $\theta$. Next, use this value of $\theta$ to predict the price of a house with 1650 square feet and 3 bedrooms. You will use value later to check your implementation of the normal equations. Don’t forget to normalize your features when you make this prediction!


In [ ]:
house = [1650, 3]
house = (house - mu) / sigma
price = r_[1, house].dot(theta)

price = price.A[0][0] # If price is a matrix, pull out the first value
print('For a house with 1650 sq-ft and 3 bedrooms, we predict a price of ${:.2f}'.format(price))

Normal Equations

In the lecture videos, you learned that the closed-form solution to linear regression is

$$\theta = \left(X^TX\right)^{-1} X^T\vec{y}$$

Using this formula does not require any feature scaling, and you will get an exact solution in one calculation: there is no “loop until convergence” like in gradient descent.

Complete the function normal_eqn() to use the formula above to calculate $\theta$. Remember that while you don’t need to scale your features, we still need to add a column of 1’s to the X matrix to have an intercept term $\theta_0$.


In [ ]:
def normalEqn(X, y):
    """Computes the closed-form solution to linear regression
    computes the closed-form solution to linear regression using the normal equations.
    """
    theta = c_[np.zeros(3)]
    
    ### YOUR CODE HERE ###
    # Instructions: Complete the code to compute the closed form solution
    #               to linear regression and put the result in theta.
    #
    
    return theta

Optional (ungraded) exercise: Now, once you have found $\theta$ using this method, use it to make a price prediction for a 1650-square-foot house with 3 bedrooms.

You should find that this gives the same predicted price as the value you obtained using the model fit with gradient descent.


In [ ]:
X = mat(data[:, :2])
X = c_[np.ones(m), X]

theta = normalEqn(X, y)
print('Theta computed from the normal equations:\n', theta)

house = [1650, 3];
price = r_[1, house].dot(theta);

price = price.A[0][0] # If price is a matrix, pull out the first value
print('For a house with 1650 sq-ft and 3 bedrooms, we predict a price of ${:.2f}'.format(price))